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ABSTRACT 

Studying genetic variations in the human genome is 
important for understanding phenotypes and 
complex traits, including rare personal variations 
and their associations with disease. The interpret- 
ation of polymorphisms requires reliable methods to 
isolate natural genetic variations, including combin- 
ations of variations, in a format suitable for down- 
stream analysis. Here, we describe a strategy 
for targeted isolation of large regions (~35kb) 
from human genomes that is also applicable 
to any genome of interest. The method relies on 
recombineering to fish out target fosmid clones 
from pools and thereby circumvents the laborious 
need to plate and screen thousands of individual 
clones. To optimize the method, a new highly 
recombineering-efficient bacterial host, including 
inducible TrfA for fosmid copy number amplifica- 
tion, was developed. Various regions were isolated 
from human embryonic stem cell lines and a 
personal genome, including highly repetitive and 
duplicated ones. The maternal and paternal alleles 
at the MECP2/IRAK 1 loci were distinguished based 
on identification of novel allele-specific single- 
nucleotide polymorphisms in regulatory regions. 
Additionally, we applied further recombineering to 
construct isogenic targeting vectors for patient- 
specific applications. These methods will facilitate 
work to understand the linkage between personal 
variations and disease propensity, as well as 
possibilities for personal genome surgery. 



INTRODUCTION 

Recent progress in single-nucleotide polymorphism (SNP) 
mapping, genome-wide association studies and mas- 
sively parallel sequencing is revealing the diversity of 
genetic variation within the human genome (1-5). They 
encompass SNPs, insertions, deletions, inversions and du- 
plications, which can be linked with disease (1,6). 
Understanding the genetic architecture of complex traits 
requires knowledge about the polymorphisms in different 
parts from the genome, including non-coding regions (6,7) 
as well as information about the haplotype phasing, that is 
the combination of polymorphisms at the maternal and 
paternal alleles (8). SNPs in intergenic and intronic 
elements like enhancers have been shown to regulate 
gene expression (9,10) and to contribute to human dis- 
orders (7,11). Recently, it was demonstrated that the 
activity of long interspersed elements contributes to inter 
individual genetic variations and can be associated with 
disease phenotypes (12,13). 

Various methods exist for genome-wide identification of 
SNPs and structural variations (1). Recent advances in 
high-throughput DNA sequencing technologies have 
enabled rapid progress in the field (14) and in the near 
future their detection in personal genomes will be per- 
formed routinely (15,16). However, the variations lying 
in duplicated and highly identical sequences are still diffi- 
cult to resolve and extensive bioinformatic analysis is 
needed to map the short next-generation sequencing 
reads in such regions (17,18). 

Although the detection of structural variations is 
very important, base pair resolution of their breakpoints 
and further functional analysis is usually required 
to define their potential impact (19,20). The existing 
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target-enrichment strategies, based on polymerase chain 
reaction (PCR) (21), hybridization or molecular inversion 
probes (15) merely detect variations, without isolation of 
the intact allele as a clone that can be further analyzed to 
link polymorphisms over large regions or to be genetically 
manipulated for downstream functional analysis. Allele 
linkages can be achieved using whole genome bacterial 
artificial chromosome (BAC) or fosmid DNA clone 
libraries (12,22) but the costs and time required to 
generate and map them are often not justified when only 
a specific region of the genome needs to be investigated. 

In this study, we present a simple approach, based on 
recombineering (23,24) for targeted isolation of genomic 
regions in a vector format, suitable for downstream 
analysis. Recombineering is a DNA engineering technol- 
ogy, based on homologous recombination in Escherichia 
coli, mediated by the X phage proteins Reda/Redp or 
their functional counterparts RecE/RecT from the Rac 
prophage (23,25). We and others have shown that 
recombineering has many applications, including 
subcloning by gap repair (25), point mutagenesis in 
BACs (24), oligonucleotide directed mutagenesis (26), 
BAC engineering for gene targeting (27,28) or protein 
tagging (29-31). The high efficiency and fidelity of 
recombineering permits high-throughput DNA engineer- 
ing at genome scale (30,31). 

Here, we demonstrate an application of recombineering 
for selective isolation of large genomic fragments of choice 
from complex genomes. It circumvents the need for the 
classical method of library screening using hybridization 
to filters or individually picking and end-sequencing tens 
of thousands of clones for indexing. The method is applic- 
able to duplicated and repetitive regions and allows for 
breakpoint resolution of structural variations at single nu- 
cleotide level. The approach further allows the generation 
of isogenic targeting constructs with homology arms 
carrying the combination of SNPs characteristic for the 
source genome. Such constructs will facilitate genome en- 
gineering in embryonic stem cells (ESCs) and induced 
pluripotent stem cells (iPSCs) for disease studies. We dem- 
onstrate the utility of the approach through isolation of 
several loci from H7 and Shef4 hES cell lines and from a 
cancerous genome and their subsequent haplotype vari- 
ation characterization. 



MATERIALS AND METHODS 

Escherichia coli strains 

All the strains used in this study are derived from E. coli 
DH10B. The strains GB05, GB05Red and DY380 as well 
as the low copy, temperature-sensitive pSOOlyPaA 
plasmid were described previously (32-34). The pSOOip 
plasmid is derivative of pSOOlypaA plasmid and encodes 
the Red(3 protein instead of the RedyPaRecA operon. The 
E. coli strain GB05RedTrfA was constructed by insertion 
of the double operon P BAD TrfA-P Rha redyparecA at the 
ybcC locus of GB05 (33). For development of the 
cassette the P Rha promoter was amplified from pRedFlp 
(30). The Pbad promoter from the P BAD redyparecA 
operon was replaced with P Rha by recombineering. 



The P B ADTrfA was amplified from the genome of E. coli 
EPI300 (Epicentre Biotechnologies, Madison, WI, USA) 
and added by recombineering to the P R h a redyParecA. 

Stability test 

For the stability test a minimal BAC clone containing two 
558 bp direct repeats was constructed from pBeloBACll 
vector [New England Biolabs (NEB), Boston, MA, USA]. 
The repeats are part of the chloramphenicol resistance 
gene {cat), which is split into two and is not functional. 
The minimal BAC clone contains also neomycin/kanamy- 
cin (neo) and zeocin (zeo) genes conferring antibiotic re- 
sistance. For the stability assay the strains were grown 
overnight at 30° C in LB supplemented with kanamycin 
(km) lOug/ml. From the overnight culture, 10 6 cells 
were inoculated in 1 ml LB containing zeo 25 (ig/ml and 
grown ON at 30 or 37°C. To estimate the number of spon- 
taneous recombinants, the cells were plated on 
LB + chloramphenicol (cm) (15ug/ml) and LB + zeo 
(15Lig/ml). 

DNA isolation and shearing 

The H7 hES DNA was prepared from cells grown in our 
laboratory under standard conditions. The primary bone 
marrow sample PS-37027 is from an acute myeloid 
leukemia (AML) patient. DNA was isolated applying 
cell lysis treatment followed by phenol-chloroform extrac- 
tion, isopropanol precipitation and ethanol washing. The 
Shef4 hES DNA was kindly provided by Andrew Smith. 
The DNA was sheared using the HydroShear device 
(Digilab Genomic Solutions, MA, USA) and shearing 
assembly 4-40 kb (Zinsser Analytic, Frankfurt/Main, 
Germany) following the protocol for preparation of 
fosmid libraries (35). The sheared DNA was end-repaired 
and ethanol precipitated according to the metagenomic 
DNA isolation protocol (Epicentre Biotechnologies). 

Fosmid library construction and DNA isolation from pools 

Fosmid libraries were constructed with pCC2Fos 
copy control library kit following the manufacturing 
protocol (Epicentre Biotechnologies). The host used 
for the construction of the library was E. coli 
GB05RedTrfA + pSC10ip. For library ligations between 
0.4 and 1.8 ug end-repaired and precipitated DNA was 
used (Supplementary Table SI). The titer of the library 
was determined and on average 3500 clones were plated 
per 15-cm culture dish containing LB agar + cm (10 (ig/ml) 
and tetracycline (tet, 5 ug/ml). Plates were incubated at 
30°C for 18-24 h. To generate the pools, colonies from 
each dish were washed off with 2 ml LB + cm + tet, 
glycerol was added to 20% and 100 (il aliquots were 
stored at — 80°C in 96-well plates. For DNA isolation 
from the pools, 25 \A aliquots were inoculated in 1 ml 
LB + cm at 37°C. The fosmids were induced to high 
copy overnight with 0.2% L(+)-arabinose and DNA was 
isolated using 96-well filter plate A (VWR International, 
Darmstadt, Germany). The DNA was combined in pools 
from one row or one column of a 96-well plate for the 
PCR test. 
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PCR pre-screen of the library 

The PCR primers for pre-screening of the library 
(Supplementary Table S2) were designed using the 
Primer3 tool (http://frodo.wi.mit.edu/primer3/). The 
oligos were chosen to be in close proximity to the site of 
cassette insertion. Their sensitivity was tested with 
Ensembl BlastN search tool with search-sensitivity of 
near-exact matches and in silico PCR (http://genome 
.ucsc.edu/cgi-bin/hgPcr). For the PCR template, 
50-100 ng DNA from each plate row or plate column 
was used. PCR amplification was performed using 
Eppendorf Mastercycler CP 534X. Thermal cycling par- 
ameters for Tag DNA polymerase (5 prime, Hamburg, 
Germany) were 95°C for 4min followed by 35 cycles of 
95°C for 15 s, annealing for 15 s (temperature indicated in 
Supplementary Table S2) and extension at 68°C for 15 s 
with a final extension of lOmin at 68° C. All the oligos in 
this study were purchased from Biomers (http://www 
.biomers.net/de.html). 

Recombinogenic cassette design and modification 

Homology arm (HA) for the capturing cassettes 
(Supplementary Table S3) were designed according to 
Ensembl (http://www.ensembl.org/index.html) genome 
version CRCh37 release 54-58. The cassettes were 
generated by PCR using the blasticidin resistance gene 
(bsd) and oligonucleotides that contain the flanking 
50 bp homology regions. The bsd selectable marker was 
amplified from the genomic ara-leu locus of strain GB05 
(previously recombined with this cassette) to prevent 
background recombination. The cassettes were 
phosphorylated at one 5'-end but not to the other 5'-end 
to generate PO or OP cassettes, where O means hydroxyl 
(36). The cassettes were purified from the PCR reaction 
using MSB Spin PCRapace kit (Invitek, Berlin, 
Germany). The cassette for testing the recombineering 
efficiency of the E. coli strains was also phosphorylated 
at one of the 5'-end. In addition two phosphorothioate 
linkages (S) were inserted in the first and second bond at 
the other 5' -end (PS cassette) (36). 

Recombineering protocol 

To screen the library by recombineering, aliquots (25 ul) 
from the PCR positive pools were grown in 1 ml LB sup- 
plemented with tet (5 ug/ml) and cm (10ug/ml) overnight 
at 30°C. The overnight culture was diluted 1/50 and grown 
in 25 ml at 30°C for 2h, followed by addition of L(+)- 
arabinose (Sigma A-3256) and L(+)-rhamnose (Sigma 
R3875) to 0.2% and growth for 45min at 37°C. The 
cells were centrifuged, transferred to an Eppendorf tube 
and washed twice with 1 ml of ice-cold 10% glycerol, 
followed by resuspension in 80 ul. About 600 ng cassette 
was added to 40 ul competent cells. For each electropor- 
ation, a pre-chilled 1 mm electroporation cuvette (BTX, 
Harvard apparatus) was used at settings 1350 V, 10 uF, 
600 O (Eppendorf Electroporator 2510). After electropor- 
ation the cells were resuspended in 1 ml SOC medium and 
incubated for 1 h at 37°C before plating on low-salt LB 
agar supplemented with 40 ug/ml blasticidin S (BSD) 



(InvivoGen, San Diego, CA, USA). The plates were 
incubated at 37°C for 18-24 h. 

Characterization of the isolated recombinant fosmids 

Between 1 and 16 clones per captured region were 
inoculated in 1 ml low salt LB supplemented with BSD 
40 ug/ml and grown overnight at 37°C then 30 jj.1 were 
inoculated in 0.5 ml TB supplemented with BSD and 
grown overnight at 37°C. To the rest, glycerol was 
added to 20% and stored at -80°C. Fosmid DNA was 
isolated by using Invisorb spin plasmid mini two (Invitek, 
Berlin, Germany) or 96-well filter plate A (VWR 
International). The clones were end-sequenced with 
pCC2Fos vector primers. Around 0.7 ug DNA was used 
for the restriction digestion experiments in a 40-ul reaction 
volume. All enzymes were supplied by NEB. 

Next-generation sequencing parameters and 
bioinformatic analysis 

Fosmid DNA was mixed in five pools at final concentra- 
tion of ~3.5 ug/6 ul so that overlapping clones were kept 
in different pools. The DNA was sheared using the 
Covaris S2 (Covaris, Inc. Massachusetts, MA, USA) to 
an average fragment size of 200 bp. The fragmented pools 
of DNA were indexed and a standard multiplex 
sequencing library for Illumina platform was prepared 
(NEB, NEBNext® DNA Sample Preparation). After 
flow cell generation on the cBOT (Illumina) standard 
single read sequencing (51 bases) was performed on the 
HiSeq 2000 platform (Illumina). A total of 1.2 x 10 s reads 
were obtained from which 75% were mappable. Mapping 
was done with Bowtie (version 0.12.7 64-bit) against 
UCSC_GRCh37/hgl9 human genome assembly. Initial 
SNP calling was carried out with samtools and subse- 
quently custom software was written and used for the 
SNP analysis. The latest snpl32 database was used to 
annotate the variations and bambino and IGV 1.5 
(Broad Insititute) software was used to identify the 
genomic regions for polymorphisms. 

Isogenic targeting constructs generation 

All constructs were in silico designed using 
Gene Construction Kit (TEXTCO BioSoftware). The 
recombineering experiments were performed in the library 
host GB05RedTrfA, which had lost the temperature- 
sensitive pSClOip plasmid by culture at 37°C. The 
recombineering protocol was the same as described for 
screening the libraries but in the subsequent steps the in- 
duction was only with L(+)-rhamnose. The capturing cas- 
settes contain 40 bp sequences flanking the bsd that serve 
as homology arms for sequential recombineering with the 
reporter cassette lacZneo (sA-T2A-LacZ-T2A-Neo-pA- 
loxP). The rest of the cassettes for generation of condi- 
tional knockout targeting construct were designed as 
already published (33). The oligos for attachment of 
homology arms by PCR to the capturing cassette, the 
sub cloning vector pl5A-pTK-DTA-ampR and the down- 
stream cassette rox-BSD-PGK-rox-loxP are given in 
Supplementary Table S4. 
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RESULTS 

Generation of recombineering proficient host for fosmid 
library construction 

Our goal was to develop an assay that can capture by 
recombineering large regions of interest from human 
genomes in a fosmid clone format suitable for sequencing 
and genetic engineering. We generated a new fosmid 
library host (GB05RedTrfA) (Figure 1), which carries in 
its genome the yP a R ec A recombineering operon (32) 
under the rhamnose inducible promoter (Prha) (37) as 
well as the TrfaA protein (38) under the arabinose indu- 
cible promoter (Pbad) (39). The TrfA protein is required 
for initiation of the replication from the bidirectional 
origin OriV and subsequent increase in the fosmid copy 
number. The strain is highly stable (Supplementary 
Figure SI) with rates of spontaneous rearrangements 
in the absence of induction comparable with the previ- 
ously published recombineering proficient hosts 
GB05(BAD)Red (33) or DY380 (34). We optimized the 
recombineering conditions using a blasticidin resistance 



cassette insertion assay into a single fosmid clone 
(Figure 1A). One of the strands of the dsDNA cassette 
was phosphorylated at the 5'-end and phosphothioate 
linkages were added to the 5'-end of the other strand, to 
facilitate the enzymatic conversion to ssDNA in vivo, 
which improves the recombineering frequencies (36). 
We tested if the recombineering efficiencies can be 
further promoted by the helper plasmids pSClOip or 
pSClOlypaA (32), in which the recombineering genes 
are also under Pbad control. The additional transient ex- 
pression of the strand annealing protein Redp alone from 
the helper plasmid pSClOip increased the frequency of 
recombination almost twice as much as the additional 
complete recombineering operon from pSClOlyPaA 
(Figure IB), indicating that overexpression of some of 
the other proteins in the operon may be detrimental to 
the overall efficiency. 

More than 3-fold increase in the number of recombin- 
ants was observed after high copy fosmid induction in 
GB05RedTrfA in comparison with the GB05Red strain 
where oriV cannot be induced (Figure IB). Using the 
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Figure 1. Fosmid library host optimization. (A) Recombineering assay with GB05RedTrfA + pSC10ip. The strain carries in the genome the modified 
red operon (32) (gam, beta, exo, recA) (red) under the control of the rhamnose inducible Rha promoter and has the TrfA gene (blue), which 
promotes high copy fosmid replication under the control of the arabinose inducible BAD promoter. The BAD promoter also drives expression of the 
Redp protein (red) located on the helper plasmid pSClOip. A random fosmid clone was chosen for the insertion of modified blasticidin (bsd) cassette 
via recombineering. The cassette is flanked with 50 bp homology arms, identical to the region of choice (green). After the selection step the 
temperature-sensitive plasmid pSClOip is lost at 37°C. OriV — bidirectional origin of replication, Fori — unidirectional origin of replication. (B) 
Strain history. All strains were derived from DH10B. The strains GB05, GB05Red and DY380 as well as the pSOOlypaA plasmid were described 
previously (32-34). The BADTrfA and RhayPosA cassettes in GB05RedTrfA were inserted at ybcC locus as the BADypaA cassette in GB05Red. (C) 
Comparison of the recombineering efficiencies of the host strains. To test recombineering efficiency, a modified blasticidin cassette was inserted in a 
randomly selected fosmid from the H7 library. The number of recombinants was normalized to the number of cells surviving electroporation. Higher 
recombineering frequencies were obtained using the pSClOip helper plasmid. High copy fosmid induction further promoted recombineering 
efficiency. 
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Pooling ca. 3500 cfu/well; PCR-screening Fishing out by Recombineering 



Figure 2. Recombineering strategy for fishing out genomic regions. (A) Fosmid library preparation. High molecular weight DNA was isolated from 
hES cell line or patient tissue sample. The DNA was sheared to ~40 kb fragments. The fragmented DNA was ligated to pCC2Fos (dark grey lines) as 
concatamers and packaged in A phage particles. The DNA was transduced into the E. coli host strain GB05RedTrfA + pSC101p. (B) Screening of the 
library by recombineering. On average 3500 cfu were plated per petri dish and then collected as a pool to a single well of a 96-well plate, which were 
pre-screened in super pools of rows and columns by PCR. Positive wells were cultured to induce the fosmids to high copy and express the Red 
proteins for recombineering before electroporation with a bsd. Recombination into the fosmid of choice conveyed blasticidin resistance after plating. 



GB05RedTrfA+pSC101(3 and transient high copy fosmid 
replication induction, we achieved up to 6.8 x 10 3 recom- 
binants per million viable cells after transformation, an 
efficiency which allows for recombineering mediated tar- 
geting of a specific clone in a complex fosmid library. 

Targeted isolation of genomic regions by recombineering 

The general outline of our approach is shown in the 
flowchart of Figure 2. First, a fosmid library is con- 
structed from mechanically sheared genomic DNA 
(Figure 2A). Next, the library is split into pools of about 
3500 clones, which are then screened by PCR. Finally, the 
target clones are fished out by recombineering through the 
insertion of a modified blasticidin cassette flanked by 
50-bp long homology arms (Figure 2B). 

We optimized the method using genomic DNA isolated 
from H7 human embryonic stem (hES) cell line (40). 
Based on the recombineering efficiencies determined with 
single fosmids (6.8 x 10 3 recombinants/ 10 6 cells) and given 
that the number of surviving cells in a typical 
recombineering reaction in the absence of selection is 
about 10 9 cells/ml, we estimated that the recombineering 
efficiency of the new host should allow us to isolate 10-100 
recombinants of a specific clone in a mixture of 10 4 clones. 
In a pilot experiment, a defined fosmid was added to pools 
of different complexities to determine that the optimal 
performance was achieved with pools of 3.5 x 10 3 
fosmids (data not shown). At that complexity, a library 
of over 3-fold coverage of the haploid human genome can 
fit in a single 96-well plate, and any region of interest can 
be isolated within 2 days, saving time and effort involved 
in screening entire libraries. 

Application of the method to retrieve various regions 

We applied the approach to capture the OCT4 locus from 
the H7 hES cell line. After recombineering, blasticidin- 
resistant colonies were obtained from five PCR positive 



pools from two independent libraries (Supplementary 
Table S5). End sequencing from the vector and restriction 
analysis established that the captured fosmids covered the 
OCT4 locus and surrounding regions (Figure 3A; 
Supplementary Figure S2 and Supplementary Table S5). 

Five further regions were retrieved from the H7 hES 
cells. For the adenosine kinase (AK), methyl CpG 
binding protein2 gene (MECP2) and paired box 6 
(PAX6) transcriptional factor we isolated the genomic 
regions, required for isogenic targeting construct gener- 
ation (Figure 3B-D). The entire MYCN and NANOG 
genes and their surrounding regions were also successfully 
captured (Figure 3E and F). NANOG has several pseudo- 
genes and one of them, NANOG PI, arose through local 
duplication of the NANOG gene (41). In order to isolate 
the gene, 100 bp of homology sequence unique to the 
NANOG locus was chosen. The captured fosmid covers 
the whole locus, an intergenic region and part of the 
neighboring gene, which is also duplicated (41). Large 
parts of the 36 kb genomic fragment contain repeats 
from which 66% belong to different classes of Alu 
elements. Restriction analysis confirmed that the highly 
repetitive fosmids were not rearranged (Supplementary 
Figure S2 and Supplementary Table S5). 

In further exercises, we used the male hES cell line Shef4 
(42) and a primary leukemic sample. With the available 
cassettes, we isolated Shef4 MECP2, OCT4, PAX6 and 
GATA4 regions (Supplementary Table S5). For the 
leukemic sample, we focused on potential disease-related 
regions of chromosome 2 and isolated two independent 
clones for each of the regions of interest (TP53I3, 
ASXL2 and MYCNOS loci). 

All target regions from both hES cells lines and the per- 
sonal genome were captured successfully (Supplementary 
Table S5). As with other recombineering applications, we 
have not found any sequence limitation in the choice of 
homology arms except for the need to avoid repeats. 
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Figure 3. Genomic regions of interest isolated from the H7 hES cell line. Depicted are the fished out clones (red), the capturing cassettes (black 
triangles) and the characteristics of the genomic region showing exons (thick yellow lines), introns (thin yellow lines), repetitive elements (all repeats) 
and G+C content (%GC). (A) Clones containing the OCT4 locus. (B) The AK locus with the capturing cassette inserted in front of exon 5. 
(C) Methyl CpG binding protein 2 locus (MECP2) and two different probes used for fishing out regions of interest. (D) The 3'-end of the PAX6 
locus. (E) Isolation of the NANOG locus from the highly similar NANOG PI pseudogene (depicted in lilac). (F) Clone covering the MYCN gene 
and the probe used, which has 69% G+C content. 



Hence the approach appears to be applicable to a diverse 
spectrum of genomic regions. No incorrect insertions were 
observed and the restriction digest analysis showed a very 
low number of rearranged clones. The number of recom- 
binants varied for each of the targeted regions but was 
within the expected range (1-728 recombinants per 
reaction). Addition of more than 500 ng of the cassette 
did not increase the number of recombinants 
(Supplementary Figure S3). 

We used single-strand DNA recombineering as it 
provides higher efficiency and fidelity (36). Either strand 
can be used, but the strand annealing to the lagging strand 
of the replication fork is favored by the recombineering 
reaction (43). In our experiments, the efficiencies between 
the two strands varied several fold (Supplementary 



Table S6), indicating that testing both strands can be 
beneficial for the isolation of difficult regions. 

Haplotype phasing and identification of allelic differences 
in H7 loci 

Regions from the H7 cell line for which more that one 
fosmid was fished out (Figure 3) were sequenced with 
Illumina in order to reconstruct the haplotype phase of 
the genomic regions. Indexed libraries, containing the 
overlapping clones were sequenced to a mean depth of 
11071 reads per base pair. Bioinformatic analyses 
indicated two positions on chromosome X with potential 
allelic differences that were supported with similar number 
of unique reads between the overlapping clones 
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Figure 4. Allele-specific SNPs at the MECP2/IRAK1 locus on the X chromosome in H7 hES cells. In the upper part of the diagram the distribution 
of uniquely mappable Illumina reads from fosmid clones H7-F and H7-C02 is shown as grey lines. Gaps indicate repetitive sequence and the average 
corresponds to 1 1 000 reads. The colored lines indicate the SNP positions in comparison with the human reference genome as follows: green-A; 
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one newly identified indel, which is present on both alleles. One of the alleles has A instead of G at position 153 290956 and G instead of C at 
position 153 285 631. The SNPs are in DNasel hypersensitive site (DHS) and in CpG island, respectively as indicated. The indel indicates an insertion 
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(Supplementary Table S7). These include differences at the 
MECP2/IRAK1 loci that are not annotated in SNP132 
database. The observed allelic polymorphisms are G/A 
at the 3'-UTR of MECP2 and C/G at the promoter 
region of IRAKI located 5325-bp downstream on the 
same allele (Figure 4). Both SNPs are in CpG 
dinucleotides and are located in regulatory regions — a 
DNasel hypersensitive site in the 3'-UTR of MECP2 
and the CpG island upstream of IRAKI (USCS genome 
browser GRSh37/hgl9). The SNP at the 3'-UTR of 
MECP2 was validated by PCR and sequencing (data not 
shown). The second SNP is located in an extremely GC 
rich region and we failed to amplify it by PCR with several 
sets of primers. 

In addition to the allele-specific SNPs we reconstructed 
the combination of SNPs across the sequenced regions of 
chromosome X, 6 and 10 (Supplementary Table S7). 
As expected more SNPs were found in the highly poly- 
morphic region of chromosome 6 than at the other loci. In 
addition several non-synonymous mutations in CCHCR1 
and TCF19 genes and small-scale indels were scattered 
across the 35 kb genomic region from chromosome 6 
(data not shown). The indels for the OCT4 loci from the 
H7 and Shef4 cell line were validated by PCR and 
sequencing (Supplementary Table S8). 

Generation of isogenic targeting constructs 

We used the retrieved fosmids to generate allele-specific 
targeting constructs for MECP2, AK and OCT4 by the 
following method. The blasticidin cassette used for fishing 
from the pools was designed to contain additional 40 bp 
homology regions to the lacZneo stop cassette 
(Figure 5A). After isolation of the isogenic clones, the 
blasticidin cassettes were replaced by recombineering 
with a lacZneo stop cassette that is flanked by the same 
40 bp homology arms (Figure 5B). For MECP2, the 
blasticidin cassette was targeted to the intron upstream 
of exon 4, which was selected because its later removal 
by Cre recombinase will cause a frame shift in the 



mRNA. Subcloning in a pl5A-origin vector and 
addition of a 3' loxP site after the frame-shifting exon 
were done following the established pipeline for condition- 
al targeting constructs generation (Figure 5C and D) (33). 
All recombineering steps after clone isolation were 
mediated by the rhamnose inducible redyPaRecA operon 
present in the genome of the GB05RedTrfA. The expected 
products were validated by restriction mapping and 
sequencing of the recombineering junctions. They have 
been successfully used for targeting in H7 hES cells 
(data not shown). 



DISCUSSION 

Studying genetic variations in the human genome is im- 
portant for the understanding of phenotypes, diseases, 
drug responsiveness and the mechanisms of complex 
traits (6). For many applications, only a small part of 
the genome, such as specific genes or regulatory regions, 
are of interest (44,45). The current methods for selected 
enrichment of genomic regions followed by next gener- 
ation sequencing are based on PCR or hybridization 
approaches (15). These methods encounter size limitations 
particularly to link variations separated by more than a 
few hundred base pairs, as well as limitations in duplicated 
and repetitive regions. 

The recombineering strategy presented here is useful for 
targeted isolation of genomic regions in a vector format 
that allows for rapid adaptation to functional analysis 
based on gene targeting (27,28) or transgenesis (30). A 
similar approach to isolate genomic regions in BACs has 
been published recently (46). We use fosmids, because they 
are easy to handle, stable, suitable for genomic structural 
variation studies (2,5,22) and preparation of targeting 
constructs. Most importantly, compared to BAC libraries, 
fosmid library construction requires much less genomic 
DNA, which is a major consideration when the source 
of DNA is a patient sample. 
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Figure 5. Workflow for the generation of isogenic conditional targeting constructs. All recombineering steps after the first one were performed in 
GB05RedTrfA after rhamnose induction of the RedypaA operon. All the genes conveying antibiotic resistance have prokaryotic promoters (data not 
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side of the frameshifting exon using blasticidin selection. The PGK-BSD gene is flanked by rox sites for later removal by Dre recombinase. 



To increase the targeting efficiency and thereby the 
complexity of the pools from which a specific region can 
be retrieved, we engineered a new strain that allows for 
switching from unidirectional to bidirectional fosmid rep- 
lication. In that way, we exploit an additional increase in 
recombineering efficiency due to increased fosmid 
copy-number after TrfA induction. This improved the iso- 
lation of genomic regions of choice from complex fosmid 
pools. The very low levels of illegitimate recombination 
reduced the need to screen through a large number of 
clones to obtain the desired region. The number of recom- 
binants varied between the captured loci, possibly reflect- 
ing the different replication speeds of the individual clones 
within the pools. Variability in the number of recombin- 
ants for several E. coli chromosomal locations has previ- 
ously been correlated with the rate of replication of the 
regions (26). 

Previously a method to screen genomic libraries by 
recombineering was reported (47). However, this method 
does not appear to have been subsequently utilized, 
possibly because the complex counter selection strategy 



imposed practical difficulties. Similarly our previous ex- 
perience with genomic cloning by recombineering (25), 
indicated certain practical limits to lambda Red recombin- 
ation in complex backgrounds. Hence, we adapted a 
recombineering method to optimally sized pools of 
cloned genomic regions. 

Fine-tuning the expression levels of the recom- 
bineering proteins not only improved the recovery of 
target clones but also likely contributed to the successful 
isolation of intact, highly repetitive, regions. Indeed, 
previous work has shown that overexpression of Redy 
from a plasmid can increase the total number of 
colonies, but the frequency of correct recombinant 
BACs was low (48). Transient RecA co-expression from 
a plasmid has been previously shown to enhance the total 
number of colonies surviving electroporation (32), but 
leaky expression of RecA could cause increased basal 
levels of unintended intramolecular rearrangements. 
That is why we expressed RecA from the genome, 
together with the Red operon, using the tightly controlled 
PRha promoter. 



Page 9 of 1 1 



Nucleic Acids Research, 2011, Vol. 39, No. 20 el37 



The extent of variation within human genomes is now 
being revealed by SNP maps and massively parallel 
sequencing (1-4). However, knowledge about the 'haplo- 
type phasing' in different genomes has been scarce (8). 
Two recently published methods for genome-wide reso- 
lution of the haplotypes (49,50) pave the way to system- 
atically study haplotype phasing in individual genomes 
and cell lines. Our approach is complementary to these 
studies and allows for the determination of SNP linkage 
and therefore the disease susceptibility throughout the 
selected regions covered by fosmid clones. Thereby, we 
reconstructed haplotypes at loci from chromosome 6, X 
and 10 from the H7 hES cell line. Comparative analysis 
between the H7 and Shef4 OCT4 haplotypes revealed dif- 
ferences in 12 SNP positions and most of the identified 
indels were cell line specific (13 of 16). These variations 
were found in more than one independent clone and there- 
fore represent true polymorphisms of the cell lines. 

Whole-genome sequencing shows that structural vari- 
ations smaller than 50 kb account for the large portion 
of polymorphism identified in individual human 
genomes (1,5). Most of these events are enriched near or 
in repeated and segmental duplicated regions and 
difficulties to resolve them have been reported by different 
investigators (5,17). Using the targeted retrieval of clones, 
we were able to distinguish between highly similar se- 
quences like NANOG and its pseudogene NANOG PI. 
Once isolated, such regions can be further characterized 
by sequencing at very high depth. This allows the descrip- 
tion of their polymorphisms at single nucleotide 
resolution. 

Exploring the impact of the mutations and their char- 
acterization as benign or disease associated can be 
achieved through gene targeting in stem cells (51,52) 
with isogenic constructs. Our approach permits generation 
of such constructs with personal genome specific combin- 
ation of variations. The isogenicity of the flanking hom- 
ologous sequences is an important issue. First, it could 
promote the targeting efficiency in human ES cells as 
was shown for mouse ES cells (48,53). Second, bearing 
in mind that SNPs may influence transcription factor 
binding and gene expression (9,10), targeting with 
isogenic vectors should not disturb the existing genomic 
context. This will be useful for gene editing in stem 
cell-based therapies. 

We identified two novel allele-specific SNPs located in 
regulatory regions on one of the X chromosome in the H7 
cell line at the MECP2/IRAK1 loci. The biological signifi- 
cance of these polymorphisms is not known. The 
whole-genome ENCODE analysis on the male HI hES 
cell line indicates that the two SNPs are located in an 
enhancer and a promoter where c-Myc and Pol2 bind, 
respectively. The SNPs are in CpG dinucleotides thus 
they may influence the binding of regulatory proteins or 
the methylation status of the two alleles. 

The high fidelity of Red/ET recombineering 
demonstrated in this and previous studies allows the 
further scale up of the method to high-throughput liquid 
format (30,31) for simultaneous isolation of multiple loci. 
For example, the method can be used to develop screening 
assays for isolation of regions affected by mobilized 



retrotransposons or other repetitive elements in personal 
genomes. Recently, numerous novel active retrotrans- 
posons were identified in the human genome (12,13). 
Although they are underrepresented in the reference 
sequence, they exist at low allele frequencies in the popu- 
lation and can be a source for disease-producing 
insertions. 

This method can also simplify the acquisition of DNA 
regions from model organisms or metagenomic studies of 
environmental samples. The approach is straight forward 
and does not require any special equipment or 
complicated computational analysis. Because it is flexible 
with many potential applications, we recommend it to a 
wide range of researchers. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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